Individuals, and particularly corporations, often generate a large amount of documentation. Such documentation may include personnel information, technical information, financial information and the like. Such documents may be used intensively whilst they have direct current relevance to the individual or corporation.
However, when a project is completed or abandoned, the documents relating to that project ceases to be of direct current relevance. It is common practice for such documents to be archived—for example, for the documents to be collected together and boxed, and put into storage. Typically, the documents are stored in a remote or less convenient location than the location of the documents when they were considered to be current. This is advantageous, as it frees up valuable office space where the documents were held previously whilst they were current. This may also have economic benefits as the storage location is generally at a location that is less expensive. Also, documents may be stored more densely at a storage location because it is specifically adapted for the storage of documents—for example, the storage location may include floor-to-ceiling racks for the storage of documents that are closely spaced within a warehouse.
When documents are current they are typically stored in files that hold a plurality of documents which have some relationship which each other. For example, documents relating to a particular project, or aspect of a project, may be held in the same file. By way of a particular example, records relating to the employment of a particular person may be held on a single file, such records including documents from their initial application for the job (e.g. curriculum vitae, covering letter, evidence of qualification, and a record of the interviews conducted), copy of the contract of employment, details of regular appraisals performed during employment, details of salary and changes to salary, disciplinary matters, and documents relating to the termination of employment.
For the sake of convenient handling, a plurality of files to be archived are often accumulated in a single container or box, and the container is sent to storage when it is full, or on the occurrence of other criteria. The accumulated items can be considered to be a collection of items. Typically the items in the box are related in some way. For example, all the items may have been issued by a single person or corporate department. A record of the name (file reference) of the files stored in the container is typically made, so that, subsequently, the box in which a particular file is stored can be identified in the storage location, and the file retrieved if the file reference is known.
The storing of documents does incur considerable expense. A corporation may accumulate an increasing quantity of archived documents over a period of time. Deciding when to destroy archived documents is conventionally very problematical. Often the personnel originally responsible for the document may have left the corporation or may have been assigned to other duties. There may be no person with a direct interest in the document. However, it may be necessary to retain the documents for legal or business reasons. Therefore, there is a tendency for documents to be retained once they have been archived, leading to a steady increase in the requirement for document storage over a period of time, and associated increasing costs.
A significant problem with archived documents is that the documents may be relevant to current or future (unknown) legal proceedings, where the documents must be disclosed as part of the “discovery” procedure. Briefly, such a discovery procedure is typically the pre-trial disclosure of documents containing information relevant to an opposing party in a legal action. Penalties for destroying relevant documents can be severe, even if this is done inadvertently.
When legal action is initiated, conventionally a responsible lawyer will define a Document Preservation Notice (DPN) which explains the subject-matter of the litigation. This may be a narrative. The DPN is provided to paralegals or lawyers who manually read through documents (including archived documents), and who then identify documents relevant to the litigation based on their review of the documents and the DPN. Such a procedure is time-consuming and expensive. If more than one DPN is in force, it is difficult for a person to search for documents relevant to both DPNs simultaneously without becoming confused. Also, if the terms of a DPN are changed (e.g. because a new aspect is added to the litigation), the documents must all be manually reviewed again.
Also, after documents are archived, it may be difficult to identify relevant documents, whether for the purposes of discovery or otherwise, if it is desired to identify documents relating to a particular general subject area unless every archived document is reviewed. Such a difficulty arises because, when documents are archived, they are not catalogued in a consistent manner (or at all), making it difficult to identify a file unless the file reference is known. Each individual or department may catalogue documents according to different rules and to a different level of detail. For example, if litigation occurs that relates to subject matter that bridges across different departments (for example personnel/HR and logistics) a different search strategy may be required to search files originating from each of those departments, as they will have been catalogued in a very different way.
One known method for cataloguing documents, so that they can be searched and identified subsequently, is to use Optical Character Recognition (OCR). According to this method, each page of a document is scanned before storage, and OCR is used to digitise the text so that it is searchable by word. This allows the text of the documents to be searched. However, OCR is a time consuming and expensive process as every page of each document needs to be scanned and converted. Also, some items that go into storage may not be documents that can be scanned, such as drawings and hardware, and such documents would have to be categorised by different method. Further, OCR does not provide any indication of the context in which are word is used. Searching text for particular words tends to produce a large number of irrelevant “hits” because there is no way of excluding documents where the context in which the word is used makes it irrelevant.
Accordingly, it would be desirable to provide an apparatus, method and computer program for archiving documents such that documents that have ongoing importance can be readily identified, documents that need to be retained for legal reasons are preserved and documents that are no longer required and do not need to be retained for any other reason can be disposed of.